This assignment is based on the material covered in James et al. We will subsequently open up solutions to the problem sets.
This question involves the use of multiple linear regression on the Auto data set. So load the data set from the ISLR package first.
If the following code chunk returns an error, you most likely have to install the ISLR package first. Use install.packages("ISLR") if this is the case.
# install.packages("ISLR")
data("Auto", package = "ISLR")
#
# !! `auto <- try(data(Auto, package="ISLR")` doesn't work. Can't assign a dataset to a variable.
Dataset description: Gas mileage, horsepower, and other information for 392 vehicles. > A data frame with 392 observations on the following 9 variables. Fields available: - mpg: miles per gallon - cylinders: Number of cylinders between 4 and 8 - displacement Engine displacement (cu. inches) - horsepower Engine horsepower - weight: Vehicle weight (lbs.) - acceleration: Time to accelerate from 0 to 60 mph (sec.) - year Model year (modulo 100) - origin: Origin of car (1. American, 2. European, 3. Japanese) - name Vehicle name
par(mfrow=c(4,3))
pairs(Auto[1:3])
pairs(Auto[4:6])
pairs(Auto[7:9])
plot(Auto)
cor(). You will need to exclude the name variable, which is qualitative.library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# can subset whith col number
print(which(colnames(Auto)=="name") )
## [1] 9
Auto1 <- Auto[c(-9)]
# can subset with names
Auto2 <- Auto[c(-which(colnames(Auto)=="name"))]
# can subset with subset
Auto3 <- subset(Auto, select = !colnames(Auto) %in% "name")
# fail to subset with filter
# `Auto4 <- Auto %>% dplyr::filter(UQ(!!as.name("name")=="name")))`
# can subset with select_if
Auto4 <- Auto %>% select_if(is.numeric)
# check if 2 dfs are equal
identical(Auto1, Auto2)
## [1] TRUE
# another method to check if 2 dfs are equal
isTRUE(all.equal(Auto2, Auto3))
## [1] TRUE
all.equal(Auto2, Auto3)
## [1] TRUE
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
year variable suggest?# another way to reg on all var --- `myregreg <- with(Auto1, lm(mpg~., data = Auto1))`
myreg = lm(mpg ~ ., data=Auto1)
summary(myreg)
##
## Call:
## lm(formula = mpg ~ ., data = Auto1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?# do the plot
plot(myreg)
par(mfrow=c(2,2))
with(Auto1, plot(cylinders, mpg))
abline(lm(mpg~cylinders, Auto1))
with(Auto1, plot(displacement, mpg))
abline(lm(mpg~displacement, Auto1), col="red")
with(Auto1, plot(weight, mpg))
abline(lm(mpg~weight, Auto1), col="red")
# intercept: -17.218435
# slope for acceleration : 0.080576
with(Auto1, plot(acceleration, mpg))
abline(lm(mpg~acceleration, Auto1))
# this snippet shows that the lines for SLR and MLR on the same variables are not the same.
par(mfrow=c(2,2))
with(Auto1, plot(acceleration, mpg))
abline(lm(mpg~acceleration, Auto1))
coefficients(myreg)
## (Intercept) cylinders displacement horsepower weight
## -17.218434622 -0.493376319 0.019895644 -0.016951144 -0.006474043
## acceleration year origin
## 0.080575838 0.750772678 1.426140495
with(Auto1, plot(acceleration, mpg))
abline(as.numeric(myreg$coefficients[1]), as.numeric(myreg$coefficients[6]), col="blue")
abline(lm(mpg ~ acceleration, Auto1), col="red")
colnames(Auto1)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
par(mfrow=c(2,1))
with(Auto1, plot(year, mpg))
abline(lm(mpg~year, Auto1))
summary(lm(mpg~year, Auto1))
##
## Call:
## lm(formula = mpg ~ year, data = Auto1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0212 -5.4411 -0.4412 4.9739 18.2088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -70.01167 6.64516 -10.54 <2e-16 ***
## year 1.23004 0.08736 14.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.363 on 390 degrees of freedom
## Multiple R-squared: 0.337, Adjusted R-squared: 0.3353
## F-statistic: 198.3 on 1 and 390 DF, p-value: < 2.2e-16
with(Auto1, plot(origin, mpg))
abline(lm(mpg~origin, Auto1))
(e) Use the
* and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? > NT: Fields available: - mpg: miles per gallon - cylinders: Number of cylinders between 4 and 8 - displacement: Engine displacement (cu. inches) - horsepower: Engine horsepower - weight: Vehicle weight (lbs.) - acceleration: Time to accelerate from 0 to 60 mph (sec.) - year: Model year (modulo 100) - origin: Origin of car (1. American, 2. European, 3. Japanese) - name: Vehicle name
horse_acceleration <- Auto1$horsepower * Auto1$acceleration
plot(Auto1$acceleration, Auto1$horsepower)
abline(lm(horsepower~acceleration, Auto1), col="red")
sprintf("It seems that horsepower and accelaration has high negative correlation of %f, with a fitted prodictor value of %f ",cor(Auto1$horsepower, Auto1$acceleration), coef(lm(horsepower~acceleration, Auto1))[2])
## [1] "It seems that horsepower and accelaration has high negative correlation of -0.689196, with a fitted prodictor value of -9.615528 "
Autoplus <- Auto1 %>% dplyr::mutate(horsepower * acceleration)
# go, go ahead with regressing on that
lm.interact <-lm(mpg ~ ., data=Autoplus)
summary(lm.interact)
##
## Call:
## lm(formula = mpg ~ ., data = Autoplus)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0329 -1.8177 -0.1183 1.7247 12.4870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.499820 4.923380 -6.601 1.36e-10 ***
## cylinders 0.083489 0.316913 0.263 0.792350
## displacement -0.007649 0.008161 -0.937 0.349244
## horsepower 0.127188 0.024746 5.140 4.40e-07 ***
## weight -0.003976 0.000716 -5.552 5.27e-08 ***
## acceleration 0.983282 0.161513 6.088 2.78e-09 ***
## year 0.755919 0.048179 15.690 < 2e-16 ***
## origin 1.035733 0.268962 3.851 0.000138 ***
## `horsepower * acceleration` -0.012139 0.001772 -6.851 2.93e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.145 on 383 degrees of freedom
## Multiple R-squared: 0.841, Adjusted R-squared: 0.8376
## F-statistic: 253.2 on 8 and 383 DF, p-value: < 2.2e-16
I note that the interaction term is highly statistically significant. In addition, adjusted r-squared has gone up from 0.8182238 to 0.8376468
sprintf("I note that the interaction term is highly statistically significant. In addition, adjusted r-squared has gone up from %f to %f", summary(myreg)$adj.r.squared, summary(lm.interact)$adj.r.squared)
## [1] "I note that the interaction term is highly statistically significant. In addition, adjusted r-squared has gone up from 0.818224 to 0.837647"
lm.horse <- lm(mpg~horsepower, data=Auto1)
summary(lm.horse)$r.squared
## [1] 0.6059483
lm.log_horse <- lm(mpg~log(horsepower), data=Auto1)
summary(lm.log_horse)$r.squared
## [1] 0.6683348
par(mfrow = c(2,2))
plot(Auto1$horsepower, Auto1$mpg)
plot(log(Auto1$horsepower), Auto1$mpg)
This question should be answered using the Carseats dataset from the ISLR package. So load the data set from the ISLR package first.
data("Carseats", package = "ISLR")
# colnames(select_if(Carseats,is.numeric))
# [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
# [6] "Price" "Age" "Education"
Sales using Price, Urban, and US.attach(Carseats)
lm.1 <- lm(Sales ~ Price + Urban + US)
summary(lm.1)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
# plot something
plot(Sales, fitted.values(lm(Sales~Price)))
abline(lm(Sales~Price), col="blue")
abline(lm(Sales~residuals(lm(Sales~Price))), col="red")
# data satisfies homogeneity assumption
plot(residuals(lm(Sales~Price)))
is.numeric(c(US, Price, Sales, Urban))
## [1] TRUE
# import script to use my own function
source("myfunctions.R")
sapply(c(1:length(colnames(lm.1$model))), describe_coef, regression=lm.1)
## [1] "Intercept of model is 13.043469"
## [2] "Coefficient 'Price' has a slope of -0.054459. It is a character variable. "
## [3] "Coefficient 'Urban' has a slope of -0.021916. It is a character variable. "
## [4] "Coefficient 'US' has a slope of 1.200573. It is a character variable. "
For every 1 dollar increase in price, given the same car type (urban or not) and car origing (U.S. made or not), sales decreases by 0.05 units. Sales for urban cars is less than the non-urban ones by 0.02 units holding car type and car price constant. In addition, U.S. made cars sell better, with a sales performance of 1.2 units more than the non-U.S. made models.
Write out the model in equation form, being careful to handle the qualitative variables properly. \[ \widehat{Sales} = \widehat{13.043469} + \widehat{-0.054459}*Price + \widehat{-0.021916}*Urban + \widehat{1.200573}*US + \epsilon\]
Urban has a t-statistic of -0.081 and a p-value of 0.936. It is statistically highly insignificant. I can reject the null hypothesis \(H_0 : \beta_Urban = 0\) at very low confidence interval. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.small <- lm(Sales ~ Price + US)
summary(lm.small)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
sprintf("R-squared for this model is %s, the model can explain %s percent of variation in Sales. ", summary(lm.small)$r.squared, round(x = summary(lm.small)$r.squared*100, digits = 2))
## [1] "R-squared for this model is 0.239262888426786, the model can explain 23.93 percent of variation in Sales. "
confint(object = lm.small, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
From the data, it can be said that the true effect of price on car sales will be within -0.0648 and -0.0442, 95% of the time, if the linear regression is performed on an infinite number of sample data. The true effect of the car being U.S. made is within 0.6915 and 1.7078 given the same conditions.
par(mfrow=c(2,2))
plot(Price, Sales)
plot(US, Sales, xlab = "US", ylab="Sales")
The graphs provide graphical evidence that there are outliers in both Price and US variables.
In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.
rnorm() function, create a vector, x, containing 100 observations drawn from a \(N(0,1)\) distribution. This represents a feature, X.set.seed(1)
X <- rnorm(100, mean = 0, sd = 1)
rnorm() function, create a vector, eps, containing 100 observations drawn from a \(N(0,0.25)\) distribution i.e. a normal distribution with mean zero and variance 0.25.set.seed(1)
eps <- rnorm(100, mean = 0, sd = 0.25)
x and eps, generate a vector y according to the model \[Y = -1 + 0.5X + \epsilon\] What is the length of the vector y? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?Y = -1 + 0.5 * X + eps
hist(Y)
x and y. Comment on what you observe.plot(X, Y)
abline(lm(Y~X))
sprintf("Y and X does not seem to be a strong relationship. In addition, it has a weak correlation of %f. ", cor(X, Y))
## [1] "Y and X does not seem to be a strong relationship. In addition, it has a weak correlation of 1.000000. "
y using x. Comment on the model obtained. How do \(\hat{\beta}_0\) and \(\hat{\beta}_1\) compare to \(\beta_0\) and \(\beta_1\)?lm.2 <- lm(Y~X)
sprintf("The model has an r-squared of %f, with a F-statistic of %f", summary(lm.2)$r.squared , summary(lm.2)$fstatistic[1])
## Warning in summary.lm(lm.2): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.2): essentially perfect fit: summary may be
## unreliable
## [1] "The model has an r-squared of 1.000000, with a F-statistic of 648726956315797987188980599226368.000000"
sprintf("The predictor X is not statistically significant with a t-statistic of only 0.237 and a p-value of 0.813. We can be highly confident that the predictor for X, $/beta_0 = 0$")
## [1] "The predictor X is not statistically significant with a t-statistic of only 0.237 and a p-value of 0.813. We can be highly confident that the predictor for X, $/beta_0 = 0$"
legend() command to create an appropriate legend.plot(Y, X)
abline(lm(Y~X), col="blue")
abline(Y,X, col="red")
# TODO add legend: `legend(0.2,0.3, "hello" )`
# TODO how to plot the population regression line?
X2 <- X^2
lm.3 <- lm(Y~X + X2)
sprintf("The new r-squared and adjusted r-squared is %f and %f respectively. This is greater than the old rsquared with just a degree-1 linear fit of %f and %f. ", summary(lm.3)$r.squared, summary(lm.3)$adj.r.squared, summary(lm.2)$r.squared, summary(lm.2)$adj.r.squared)
## Warning in summary.lm(lm.3): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.3): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.2): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.2): essentially perfect fit: summary may be
## unreliable
## [1] "The new r-squared and adjusted r-squared is 1.000000 and 1.000000 respectively. This is greater than the old rsquared with just a degree-1 linear fit of 1.000000 and 1.000000. "
# can adj.r.squared be negative???
plot(X, Y)
abline(lm.2, col="red")
abline(lm.3, col="blue")
## Warning in abline(lm.3, col = "blue"): only using the first two of 3
## regression coefficients
set.seed(1)
eps_lessnoise <- rnorm(100, mean = 0, sd = 0.2)
Y_lessnoise <- -1 + 0.5*X + eps_lessnoise
lm.4 <- lm(Y_lessnoise~X)
lm.5 <- lm(Y_lessnoise ~ X + X2)
sprintf("After having reduced the error term, it is found that the new r-squared and adjusted r-squared for degree-1 linear model is %f and %f respectively. Mean while, the polynomial fit has an rsquared of %f and adjusted r-squared of %f. Therefore, linear fit looks better for this simulated data.", summary(lm.4)$r.squared, summary(lm.4)$adj.r.squared, summary(lm.5)$r.squared, summary(lm.5)$adj.r.squared)
## Warning in summary.lm(lm.4): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.4): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.5): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.5): essentially perfect fit: summary may be
## unreliable
## [1] "After having reduced the error term, it is found that the new r-squared and adjusted r-squared for degree-1 linear model is 1.000000 and 1.000000 respectively. Mean while, the polynomial fit has an rsquared of 1.000000 and adjusted r-squared of 1.000000. Therefore, linear fit looks better for this simulated data."
plot(X, Y_lessnoise)
abline(lm.4, col="red")
print("Graphically, this looks like a linear fit. ")
## [1] "Graphically, this looks like a linear fit. "
set.seed(1)
eps_morenoise <- rnorm(100, mean = 0, sd = 5)
Y_morenoise <- -1 + 0.5*X + eps_morenoise
lm.6 <- lm(Y_morenoise~X)
lm.7 <- lm(Y_morenoise ~ X + X2)
sprintf("After having increased variation in the error term, it is found that the new r-squared and adjusted r-squared for degree-1 linear model is %f and %f respectively. Mean while, the polynomial fit has an rsquared of %f and adjusted r-squared of %f. Therefore, polynomial fit looks better for this simulated data.", summary(lm.6)$r.squared, summary(lm.6)$adj.r.squared, summary(lm.7)$r.squared, summary(lm.7)$adj.r.squared)
## Warning in summary.lm(lm.6): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.6): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.7): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.7): essentially perfect fit: summary may be
## unreliable
## [1] "After having increased variation in the error term, it is found that the new r-squared and adjusted r-squared for degree-1 linear model is 1.000000 and 1.000000 respectively. Mean while, the polynomial fit has an rsquared of 1.000000 and adjusted r-squared of 1.000000. Therefore, polynomial fit looks better for this simulated data."
# TODO adjusted r-squared is again negative.
confint(lm.2)
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## 2.5 % 97.5 %
## (Intercept) -1.00 -1.00
## X 0.75 0.75
confint(lm.4)
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## 2.5 % 97.5 %
## (Intercept) -1.0 -1.0
## X 0.7 0.7
confint(lm.6)
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## 2.5 % 97.5 %
## (Intercept) -1.0 -1.0
## X 5.5 5.5
sprintf("The range of confident interval for X in ORIGINAL model is %f ", range_of_confint(lm.4))
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## [1] "The range of confident interval for X in ORIGINAL model is 1.700000 "
sprintf("The range of confident interval for X in the NOISIER model is %f ", range_of_confint(lm.2))
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## [1] "The range of confident interval for X in the NOISIER model is 1.750000 "
sprintf("The range of confident interval for X in the LESS NOISY model is %f ", range_of_confint(lm.6))
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## [1] "The range of confident interval for X in the LESS NOISY model is 6.500000 "